In this section, we will be using dimensionality reduction techniques to gain insights on public transit data for major US cities. The dataset for this comes from the American Public Transit Association Ridership Report, which contains details about public transit ridership from 2022. 1
While this dataset does contain some information on the volume of ridership, unsupervised learning is generally done without a target variable or known relationships within the data. Thus, the features of this dataset are city population, city area (square miles), average cost per trip (dollars), average fare per trip (dollars), and average miles per trip, where the observation unit for each record is an individual city. In practice, all of these features could be factors in understanding the health of a public transit system, as they all either provide information on the city itself, the conditions for the riders, or the cost for the city. Thus, the objective of dimensionality reduction is to discover things about these features and how they interact with one another.
To do dimensionality reduction, there are two common methods: Principal Component Analysis (PCA) and T-distributed Stochastic Neighbor Embedding (TSNE). Both will be applied to this dataset. We will be using the following Python libraries to accomplish these:
numpy for obtaining eigenvalues and eigenvectors
sklearn for implementing PCA and TSNE
matplotlib and seaborn for visualizations
Implementation
Dimensionality Reduction with PCA
Code
import pandas as pdcities = pd.read_csv('../data/cleaned_data/apta_cities_cleaned.csv')cities = cities.drop(columns=['Unnamed: 0'])cities.head()
City
Population
Area
Cost_per_trip
Fare_per_trip
Miles_per_trip
0
Seattle--Tacoma, WA
3544011
982.52
13.906032
1.570667
5.786344
1
Spokane, WA
447279
171.67
13.433827
0.988308
4.772569
2
Yakima, WA
133145
55.77
19.720093
1.112531
5.179168
3
Eugene, OR
270179
73.49
10.851494
2.753356
3.684118
4
Portland, OR--WA
2104238
519.30
10.804361
1.025659
4.011388
A crucial step for PCA is first obtaining eigenvalues and eigenvectors to figure out the properties of the feature matrix. This process is printed below.
Code
import jsonimport numpy as npimport matplotlib.pyplot as pltfrom sklearn.metrics import silhouette_samples, silhouette_scoreX = cities.drop(columns=['City']).to_numpy()print('NUMERIC MEAN:\n',np.mean(X,axis=0))print("X SHAPE",X.shape)print("NUMERIC COV:")print(np.cov(X.T))from numpy import linalg as LAw, v1 = LA.eig(np.cov(X.T))print("\nCOV EIGENVALUES:",w)print("COV EIGENVECTORS (across rows):")print(v1.T)
NUMERIC MEAN:
[7.63817374e+05 2.54954371e+02 1.62164796e+01 1.69764181e+00
6.02033451e+00]
X SHAPE (286, 5)
NUMERIC COV:
Intel MKL WARNING: Support of Intel(R) Streaming SIMD Extensions 4.2 (Intel(R) SSE4.2) enabled only processors has been deprecated. Intel oneAPI Math Kernel Library 2025.0 will require Intel(R) Advanced Vector Extensions (Intel(R) AVX) instructions.
[[ 3.00348049e+12 6.21976504e+08 -2.68509287e+06 -2.80714130e+05
-3.74964286e+05]
[ 6.21976504e+08 1.60662563e+05 -7.09356971e+02 -8.34065149e+01
-1.18548286e+02]
[-2.68509287e+06 -7.09356971e+02 1.12521769e+02 1.14344019e+01
1.54309551e+01]
[-2.80714130e+05 -8.34065149e+01 1.14344019e+01 1.07182685e+01
6.17463373e+00]
[-3.74964286e+05 -1.18548286e+02 1.54309551e+01 6.17463373e+00
2.61947556e+01]]
Intel MKL WARNING: Support of Intel(R) Streaming SIMD Extensions 4.2 (Intel(R) SSE4.2) enabled only processors has been deprecated. Intel oneAPI Math Kernel Library 2025.0 will require Intel(R) Advanced Vector Extensions (Intel(R) AVX) instructions.
COV EIGENVALUES: [3.00348062e+12 3.18612161e+04 1.13357226e+02 2.46043548e+01
8.18652852e+00]
COV EIGENVECTORS (across rows):
[[-9.99999979e-01 -2.07085246e-04 8.93993769e-07 9.34629441e-08
1.24843257e-07]
[ 2.07087147e-04 -9.99987172e-01 4.82944608e-03 7.95476612e-04
1.28713329e-03]
[-1.36812764e-07 5.03927972e-03 9.77725251e-01 1.15731143e-01
1.75026404e-01]
[ 1.31093957e-07 -4.57011124e-04 1.99629231e-01 -2.56090918e-01
-9.45814677e-01]
[ 2.27827260e-08 -9.92273370e-05 6.46388484e-02 -9.59699490e-01
2.73493506e-01]]
Upon obtaining the properties of our dataset, we can now use sklearn to implement PCA. The steps for this are as follows:
Normalize the feature matrix using the StandardScalar() function
Sort the eigenvectors in order to prioritize principal components
Find the cumulative sum of the principal components to find the proportion of the variance that can be explained by each number of features
Plot this cumulative distribution
Note: Some of this code is repurposed from https://www.geeksforgeeks.org/reduce-data-dimentionality-using-pca-python/2
Intel MKL WARNING: Support of Intel(R) Streaming SIMD Extensions 4.2 (Intel(R) SSE4.2) enabled only processors has been deprecated. Intel oneAPI Math Kernel Library 2025.0 will require Intel(R) Advanced Vector Extensions (Intel(R) AVX) instructions.
/var/folders/z5/l6g0391s0qg3vsbnvl7y81n80000gn/T/ipykernel_36075/3215643524.py:17: MatplotlibDeprecationWarning: Passing the emit parameter of set_xlim() positionally is deprecated since Matplotlib 3.6; the parameter will become keyword-only two minor releases later.
plt.xlim(1, 5, 1)
(1.0, 5.0)
From the plot above, we can see that greater than 95% of the cumulative explained variance is covered by 4 components. Therefore, it is reasonable to select that as the number of principal components. A good way to check the efficacy of this is to plot the covariance using seaborn from before and after PCA. Below is the covariance matrix of the feature dataset, which clearly shows a significant amount of covariance between a few of the variables.
from sklearn.decomposition import PCApca = PCA(n_components=4)pca.fit(X)data_pca = pca.transform(X)data_pca = pd.DataFrame(data_pca,columns=['PC1','PC2','PC3','PC4'])data_pca.head()
PC1
PC2
PC3
PC4
0
2.195331
1.041785
-0.110497
-0.016923
1
-0.031348
-0.494236
0.066595
0.048293
2
-0.552852
-0.343962
-0.319175
-0.122077
3
-0.252383
-0.531766
0.230998
0.663955
4
1.206063
-0.041013
0.042951
0.169983
After applying PCA, below is another heatmap showing the covariance between principal components, which greatly highlights the usefulness of this process. There is essentially no covariance between principal components, indicating that the 4 component selection was effective in summarizing data.
Code
sns.heatmap(data_pca.corr())
<Axes: >
Finally, below is an interactive 3D plot to visualize the data after selecting principal components.
Note: visualization not yet working.
Code
import matplotlib.pyplot as pltfrom mpl_toolkits.mplot3d import Axes3D%matplotlib widgetfig = plt.figure()ax = Axes3D(fig)ax.scatter(data_pca['PC2'],data_pca['PC3'],data_pca['PC4'], c=data_pca['PC1'])ax.set_title("3D Plot of Principal Components")ax.set_xlabel('PC2')ax.set_ylabel('PC3')ax.set_zlabel('PC4')plt.show()
Dimensionality Reduction with TSNE
For implementing TSNE, we will once again be using sklearn. The TSNE() function unfortunately limits to three components, so this will mainly be used for parameter tuning to analyze different perplexities and how they affect our visualizations. The results of a couple of these implementations are below:
Code
from sklearn.manifold import TSNEX_embedded = TSNE(n_components=3, learning_rate='auto',init='random', perplexity=1).fit_transform(X)# EXPLORE RESULTSprint("RESULTS") print("shape : ",X_embedded.shape)print("First few points : \n",X_embedded[0:4,:])# PLOT plt.scatter(X_embedded[:,0],X_embedded[:,1], alpha=0.5)plt.show()
Ultimately, for this application, PCA proved to be a more useful process for understanding relationships within the feature matrix of our data. In general, PCA is ideal for preserving variance in the data, while TSNE preserves relationships more effectively. The crucial difference between the two is that PCA is a linear technique while TSNE is non-linear. For a dataset like this one, where ordering of data points is not a factor, the features were separable from one another, and the initial dimensionality is quite low, PCA is likely to be more effective.
Footnotes
“Raw monthly ridership (no adjustments or estimates),” Raw Monthly Ridership (No Adjustments or Estimates) | FTA, https://www.transit.dot.gov/ntd/data-product/monthly-module-raw-data-release (accessed Nov. 14, 2023).↩︎
“Reduce data dimensionality using PCA - Python,” GeeksforGeeks, https://www.geeksforgeeks.org/reduce-data-dimentionality-using-pca-python/ (accessed Nov. 14, 2023).↩︎